The goal of this project is to create a machine learning model that can successfully predict whether a patient will die due to heart failure based off of some patient history and vital signs.
Although heart failure sounds like the heart may have stopped, this is not the case. Heart failure, which is also known as congestive heart failure is a serious, incurable condition where the heart does not work properly and fails to pump blood sufficiently throughout the body for its needs. Heart failure may occur of the heart can’t fill up with enough blood or if the heart is simply too weak to properly pump.
According to the Center for Disease Control and Prevention, more than 6 million adults in the United States suffer from heart failure.
According to the National Heart, Lung, and Blood Institute (NHLBI), “Heart failure may not cause symptoms right away. But eventually, you may feel tired and short of breath and notice fluid buildup in your lower body, around your stomach, or your neck.” Heart failure can also eventually cause damage to other organs such as the liver or kidneys and lead to other conditions such as pulmonary hypertension, heart valve disease, and sudden cardiac arrest.
Although heart disease is incurable, the Mayo Clinic states that “Proper treatment can improve the signs and symptoms of heart failure and may help some people live longer,” and that “Lifestyle changes - such as losing weight, exercising, and managing stress - can improve your quality of life” (Staff 2021)
Although heart failure may be incurable, it could still be beneficial for medical professionals to predict whether a patient may develop and potentially die from heart failure. For example, if a doctor can determine with high probability that a patient may develop heart failure later in life, they may be able to inform the patient so that they can make lifestyle changes early enough to prevent the most significant symptoms.
Additionally, although the body initially tries to mask the problem of heart failure through various mechanisms such as enlarging the heart, developing more muscle mass, or pumping faster, these solutions are all temporary and in these cases, heart failure will simply progress until the onset of more serious symptoms such as fatigue or breathing problems. Since treatment can often slow down the progression of heart failure, having a machine learning model that could successfully predict a person’s chances of suffering and hence dying from heart failure would mean that we could increase early detection and likely catch more cases early on and slow the progression of the disease.
Since the data set I will use includes deaths as a result of heart failure, creating an effective machine learning model out of this data set would also allow doctors to preemptively begin treatment that may prevent the patient from dying due to heart failure.
This data set was assembled as part of a study conducted on heart failure patients who were admitted to Institute of Cardiology and Allied hospital Faisalabad-Pakistan between April-December 2015 (Ahmad T 2017). All patients in this case had left-ventricular systolic dysfunction, meaning that the left ventricle was unable to contract vigorously, which would indicate a pumping problem (Staff 2021). Furthermore, patients in this study all fell into the New York Heart Association (NYHA) Functional Classification levels III and IV.
We will first begin by loading in the packages we will use for the
project and by loading raw heart failure data to the variable
heartfailure_data.
# Loading in libraries we will be using
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(knitr)
library(corrplot)
library(ggthemes)
library(gt)
library(gtExtras)
library(visdat)
library(fastDummies)
tidymodels_prefer()
# Read raw data into a data frame.
heartfailure_data <- read_csv("heart_failure_clinical_records_dataset.csv")
head(heartfailure_data) %>%
gt() %>%
gt_theme_nytimes() %>%
tab_header("Heart Failure Data")
| Heart Failure Data | ||||||||||||
| age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | high_blood_pressure | platelets | serum_creatinine | serum_sodium | sex | smoking | time | DEATH_EVENT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 75 | 0 | 582 | 0 | 20 | 1 | 265000 | 1.9 | 130 | 1 | 0 | 4 | 1 |
| 55 | 0 | 7861 | 0 | 38 | 0 | 263358 | 1.1 | 136 | 1 | 0 | 6 | 1 |
| 65 | 0 | 146 | 0 | 20 | 0 | 162000 | 1.3 | 129 | 1 | 1 | 7 | 1 |
| 50 | 1 | 111 | 0 | 20 | 0 | 210000 | 1.9 | 137 | 1 | 0 | 7 | 1 |
| 65 | 1 | 160 | 1 | 20 | 0 | 327000 | 2.7 | 116 | 0 | 0 | 8 | 1 |
| 90 | 1 | 47 | 0 | 40 | 1 | 204000 | 2.1 | 132 | 1 | 1 | 8 | 1 |
The data was obtained from the Kaggle Data set “Heart Failure Prediction”, with the original data being from a study conducted by Tanvir Ahmad, Assia Munir, Sajjad Haider Bhatti, Muhammad Aftab, and Muhammad Ali Raza. \[\\\] ### Tidying Our Data
We can now look at some basic information about the size of our data set:
dim(heartfailure_data)
## [1] 299 13
We can see that our data set has 299 observations to go along with 13 variables. Let us now take a look at a summary of our variables:
vis_dat(heartfailure_data)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the visdat package.
## Please report the issue at <]8;;https://github.com/ropensci/visdat/issueshttps://github.com/ropensci/visdat/issues]8;;>.
We can see that our data does not include any missing data, so that is
not something that we need to worry about. We also see that all of our
data is of type numeric, even though some of our variables, including
anaemia, diabetes,
high_blood_pressure, sex,
smoking, and DEATH_EVENT are binary, so we
will have to deal with that.
heartfailure_data$anaemia <- as.factor(heartfailure_data$anaemia)
heartfailure_data$diabetes <- as.factor(heartfailure_data$diabetes)
heartfailure_data$high_blood_pressure <- as.factor(heartfailure_data$high_blood_pressure)
heartfailure_data$sex <- as.factor(heartfailure_data$sex)
heartfailure_data$smoking <- as.factor(heartfailure_data$smoking)
heartfailure_data$DEATH_EVENT <- as.factor(heartfailure_data$DEATH_EVENT)
When looking at information about the original data set, I also
noticed that time indicated either the number of days until
the patients died, or the number of days until the patient was censored,
which in this case simply means that they did not die. Due to this, I
have decided that this information would not only be hard to interpret,
it would also be irrelevant to whether the patient actually died of
heart failure or not so will elect to remove that from the data set I
will use for the machine learning models.
heartfailure_data <- heartfailure_data %>% select(-time)
Now we are left with the following variables which will be utilized
for the machine learning model: - age: The age of the
patients in the study - anaemia: Patients were considered
anemic (indicated by a 1) if their haematocrit levels were lower than
36%, indicated by 0 if patient was not anemic. -
creatinine_phosphokinase: The amount of creatinine
phosphokinase (CPK) in the blood. CPK is often released into the blood
when muscle tissue gets damaged. - diabetes: 1 if the
patient has diabetes, 0 if patient does not have diabetes. -
ejection_fraction: Indicates the percentage of blood the
left ventricle pumped out upon each contraction. -
platelets: Result of platelet count, which measure the
number of platelets in the blood. - serum_creatinine:
Creatinine levels in the blood. High serum creatinine levels indicate
that the kidneys may not be functioning properly (Roth 2019). - serum_sodium:
Results of a blood sodium test. Low serum sodium levels may be an
indicator of heart failure (Case-Lo 2018)
- sex: 1 if the patient is male, 0 if the patient is
female. - smoking: 1 if the patient smokes, o if the
patient does not smoke. - DEATH_EVENT: 1 if the patient
died during the course of the study, 0 if the patient did not die during
the course of the study.
We will first look at the distribution of heart failure deaths
ggplot(heartfailure_data, aes(x = as.factor(DEATH_EVENT), fill = "#69b3a2")) +
geom_bar() +
scale_fill_manual(values = "#69b3a2") +
labs(title = "Distribution of Heart Failure Deaths", x = "Death Event", y = "Count") +
theme(legend.position = "none")
From the histogram, we see that most of the patients in the study did
not die during the duration of the study. In fact, of the 299
observations, we 32.1% of the patients died and 67.9% of the patients
did not die.
ggplot(data = heartfailure_data, aes(x = age, group = DEATH_EVENT, fill = DEATH_EVENT)) +
geom_density(adjust = 1.5, alpha = .4) +
scale_fill_manual(labels = c("Patient Did Not Die", "Patient Died"), values = c("lightblue", "pink")) %>%
labs(title = "Distribution of Patients who Lived/Died during Study", x = "Age", y = "Density")
ggplot(data = heartfailure_data, aes(x = age, group = DEATH_EVENT, fill = DEATH_EVENT)) +
geom_histogram(bins = 39) %>%
scale_fill_manual(labels = c("Patient Did Not Die", "Patient Died"), values = c("lightblue", "pink")) %>%
labs(title = "Distribution of Patients who Lived/Died during Study", x = "Age", y = "Count")